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We present a global algorithm for training multilayer neural networks in this Letter. The algo- 
rithm is focused on controlling the local fields of neurons induced by the input of samples by random 
adaptations of the synaptic weights. Unlike the backpropagation algorithm, the networks may have 
discrete-state weights, and may apply either differentiable or nondifferentiable neural transfer func- 
tions. A two-layer network is trained as an example to separate a linearly inseparable set of samples 
into two categories, and its powerful generalization capacity is emphasized. The extension to more 
general cases is straightforward. 
The multilayer neural network, trained by the backpropagation(BP) algorithm, is currently the most widely used 
neural network since it can solve linearly inseparable classification problems [1,2]. The BP algorithm is responsible 
for the rebirth of neural networks. 

However, the BP algorithm has several limitations. Firstly, it requires the neural transfer functions to be differ- 
entiable in order to calculate the derivatives with respect to the synaptic weights. Secondly, the performance index 
to be minimized is constantly to be mean square error, because a non-quadratic performance index may result in 
very complex performance surface. Finally, the synaptic weights obtained by this algorithm are continuous as a 
consequence of its updating equation of synaptic weights. These limitations are also inherent in variations of the BP 
algorithm. The last is indeed also a limitation for most of the learning rules for training single-layer neural networks. 
Discrete synaptic states have not only the advantage for digital hardware realization but also an experimental reality. 
Recent experiments have shown that the synaptic states in certain real neural network systems may be discrete [3-5] . 
The BP algorithm is a local learning rule. Most of the learning rules for training single-layer neural networks, such 
■ as the perceptron rule, the Hebb rule, and the Widrow-Hoff rule, are local rules. When training a network using 
Oh| a local rule, one inputs the samples into the network one by one, and each time the synaptic weights are updated 
independently on other samples. A step of update of synaptic weights induced by the input of a sample is an optimal 
' solution for this sample, but not for other samples. 
^ , In principle, it is more favorable if each step of the update is an optimal solution for all the samples. This requires 
' the consideriation of the whole set of samples globally. An influential example of global rules is the Pseudoinverse 
\l , rule [1,6] used for training single-layer networks. 

One of the present authors has recently proposed another global learning rule, called the Monte Carlo adaptation 
algorithm (shorten as MCA algorithm hereafter) [7] . The basic idea is to make an adaptation to a randomly chosen 
synaptic weight, and accept the adaptation if it improves the network performance globally. A realization of the MCA 



43 ■ 

Of 



o 
o 



algorithm had been used to train single-layer feedback neural networks with binary-state synaptic weights [7,8]. 

The purpose of this Letter is to present a general version of the MCA algorithm which is applicable to train multilayer 
neural networks with either continuous or discrete synaptic states, and with either differentiable or nondifferentiable 
neural transfer functions. Based on the observation that the network performance is determined by the local fields 
of neurons induced by the input of the samples (shorten as LFNIIS), our algorithm is focused on controlling the 
C"| " distributions of LFNIIS by continuously adapting the synaptic weights. Two steps are applied to perform the control. 
Qh The first one is to determine the target distribution of the LFNIIS, define the states of synaptic weights, and chose 
the transfer function of neurons for each layer respectively. The second one is to randomly select a synaptic weight, 
and randomly adapt it to a new state, then make a decision whether or not to accept this adaptation by a criterion. 
This step is repeated till the distributions overlap with the target ones. The criterion for acceptable adaptations is 
crucial for the algorithm. In principle, we accept an adaptation if the distributions of LFNIIS induced by it does not 
diverge away from the target distributions statistically. This guarantees the distributions of LFNIIS evolving towards 
the targets in a one-way manner. 

As a realization example of the above framework, we train a two-layer neural network to separate a set of linearly 
inseparable samples into two categories. In order to demonstrate the network not only overcomes the limitations of 
BP networks but also improves the network performance, we emphasize its powerful generalization capacity over a 
two-layer BP network. The generalization capacity is essential for a neural network. This is because the sample set is 
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normally representative of a much larger class of patterns. It is particularly important that the network successfully 
generalize what it has learned to the total population [1,9,10]. In our example, when we input an unlearned pattern 
having higher similarity with one sample, it is naturally desirable that the network can classify it into the same 
category that the sample belongs to. The degree of the average divergency of the pattern from the sample below 
which the network can correctly categorize measures the generalization capacity. 

Suppose there are M samples available for the training, and the /zth sample is represented by an TV-dimensional 
binary vector £ M = {£f,i = 1, -;N} with = ±1. It has been proved [11] that the maximum amount of samples 
that can be linearly separable is M < 2N if the samples have no correlations, and a single-layer neural network 
trained by the perceptron rule can fulfil this task. To solve linearly inseparable classification problems, one has to 
apply multilayer neural networks. 

Let represent the weight matrix and v( i_1 ) represent the input vector of the ith layer of a multilayer neural 
network. The output of the Ith layer is determined by the equations: 

we- 1 ' 

h?= £ (i) 

3=1 

flJW'W), (2) 

where h^p is the local field of the ith neuron in the Ith layer and o-W is its transfer function. In the equation, N^> 
represents the number of neurons in the ith layer with = N . 

A two-layer neural network has a hidden layer and a output layer. When the /xth sample ^ is inputted into the 
network one obtains the local field, h^J = J2jLi ^ij^j^ an< ^ the output, v ^ = ^H^J), of the ith neuron in the 
hidden layer. 

For the output layer, because we want the network to separate the samples into only two categories, one neuron in 
this layer is enough. In this case, the weight matrix will be a 1 x iV' 1 ) matrix, whose elements will be denoted by j[j ■ 
Here is the number of neurons in the hidden layer. Inputting the vector to the output layer one obtains the 
local field, hffi — SjLi ^ij^j/?> anc ^ * ne output, = a (2 \h^), of the neuron. 

Let Ei and E 2 represent the two categories of samples. Our goal is to establish the connections, = 1 if ^ 6 Ei 
and 4 2) = -1 if ^ e E 2 , by the proper solution of and . To fulfil this goal, the transfer function of the 
neuron in the output layer must be the step function: u^ 2 \x) = 1 for x > and a^(x) = —1 for x < 0. 

The establishment of the connections implies the satisfaction of the condition t^h^ > in terms of the local fields, 
where = 1 for ^ 6 Ei and = —1 for £ p e E 2 . However, this is not enough for the generalization. When 
inputting a vector which has a set of elements, denoted by {/c}, different from, say, the /ith sample, then the local 

fields of the neurons in the hidden layer induced by this input should be Ii^J = h$ — 2^ J^^j*, where the sum 

is over the set of {k}. This in turn results in a set of elements, denoted by {k'}, different from vj, and leads to 

neuron in the output layer, where the sum is over {k'}. The generalization capacity 

is thus determined by the capability of conserving the sign of Ii^J and hffi under as many mutations as possible of 

the sample which requires the absolute values of not only but also as big as possible. 

Thus, to gain better generalization capacity we should expect the distribution of satisfy the condition t^h^ > c, 

where c is a positive parameter. For thc hidden layer, there is no restriction on thc sign of a specific , we thus 

define d{ = J^^zf 1 \h$ \ to roughly measure the mean absolute value of the local fields. To gain better generalization 
capacity we expect di to be as large as possible. 

We apply the following procedure to train the network to find a set of solution of synaptic weights that gurantees 
thc desired distributions of h^J and hfjP be satisfied. 

(1) Initialize J-^ with jf) e {0^\k = l,...,p} randomly with equal probability; calculate h^J , and rfj. 
Here 6*jp is a state of jf) . 

(2) Randomly select a jf) and randomly adapt it to a new state 6~' '; if I = 1 calculate 
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if Z = 2 calculate 




Then calculate 
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u^,etc, otherwise remain the old ones; 



return to step (2) till the condition t^h^ > c is achieved and di can not be further enlarged. 

With a set of parameters N = 1000, = 1000, M = 2400, c = 30, and applying the binary weights jf) G 

{+1,-1} while adopting the step transfer function for each neuron, we tested the above training procedure by 
separating the samples into two sets with equal samples. Without loss of the generality, we suppose £ M G Si for 
fi = 1, M/2 and ^ G S 2 for = M/2 + 1, M. Note that the samples are linearly inseparable since M > 2N. 

In Fig. 1(a) and 1(b) the up-trianglcs show the distributions of h$ and hffi respectively. It can be seen that 
distributes in the region of t^h^ > 30 correctly, and the distribution of h$ shows a two-peak structure. 

The two-peak structure is a consequence of controlling the distribution of h^J by restricting di > di for acceptable 

(2) 

adaptations. If merely employ n > as the criterion for acceptable adaptations, the distribution of can fulfil the 

condition t^h^ > 30 easily. However, the distribution of h^J will be out of control. The open stars in Fig. 1(a) and 

1(b) show the distributions of h^J and respectively. It can be seen that distributes around the origin with 

a single-peak structure, while the distribution of is similar to that obtained by using n > and di > di as the 
criterion. 

For the generalization, the distribution of h^J with the two-peak structure is obviously preferable than that with 

the single-peak structure, since the amount of elements of h^J with small absolute values in the former case is much 
less than that in the latter case. Figure 2 confirms this prediction. In the figure, the triangles and stars show the 
generalization capacity of the networks obtained with the criterion di > di and n > 0, and with merely the criterion 
n > 0, respectively. The horizontal axis is the mean percentage of the difference between an input vector and one of 
the samples. The vertical axis is the rate of correct classification. It is clear that the former network has much higher 
generalization capacity than the later one. Note that an input vector with no correlation with any sample has equal 
probability to be classified into either category, a rate of 0.5 therefore indicates the total loss of the generalization 



By adopting the mean square error < (hj? — c) 2 > as the performance index and the analog function a^(x) = 
tanh(x) as the transfer function for each neuron in the hidden layer, one can obtain a two-layer network capable of 
categorizing the same set of samples using the BP algorithm. In order to make comparison with our network, we 

(2) 

apply c = 34 and stop the learning procedure after the condition t^h^ > 30 is satisfied for all samples. The weights 
are normalized to satisfy < jf) >= 1. The dot-lines in Fig. 1 show the distributions of the LFNIIS for the BP 
network. It can be found that hfjP distributes around ±c as two Guassian-like peaks, and h^J distributes around the 
origin. Clearly, the distribution of the LFNIIS for the hidden layer is out of the control of the algorithm since h^J 
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is not included in the performance index, and the two peaks in the distribution of is induced by the operation 
of minimizing the mean square error. The minimization operation drives the local fields not only with smaller value 
but also with larger value of t^h^ concentrated towards c synchronously, while the larger values are favorable for the 
generalization capacity as explained earlier. Thus, the generalization capacity of the BP network would be even worse 
than that of our network obtained with merely the criterion n > 0. The rate of correct generalization represented by 
the dots in Fig. 2 confirms this prediction. 

Our procedure can be directly extended to train neural networks with synaptic weights having more discrete states. 
It can be found easily that when the states are extended to infinite, e.g., jf) G {±1, ±3, ±5, }, the weights indeed 
become continuous (after been normalized) . We have observed that the network performance can be further improved 
by increasing the discrete states. To show this, we made, between the networks with J-^ G {±1} and the networks 
with jjp G {±1, ±3}, a comparison of the maximum capacity of separating no-corrclation samples into two sets with 

equal members. The weights in the output layer are fixed at G {±1} for both networks. The results arc shown in 
Fig. 3, where M/N is the normalized maximum amount of the samples that can be separated into two sets correctly 
within O.OIMNN^ times of repeat of the steps (2)-(3), and iV^ 1 ) /N is the normalized number of neurons in the hidden 

layer. The up- and down-triangles represent the results for networks with jjp G {±1} and with jff G {±1,±3} 
respectively. In the calculation we fix N = 500. One can see from the figure that the maximum capacity increases as 
the increase of the neurons in the hidden layer, and increases with the increase of the discrete states of weights. 

In summary, unlike the BP algorithm, the improved MCA algorithm puts no restriction to the neural transfer 
function and is applicable to train neural networks with either discrete or continuous synaptic weights. Another 
key difference is that we implement the desired network performance by controlling the distributions of the LFNIIS, 
while the BP algorithm approaches this goal by minimizing the performance index defined constantly as the mean 
square error. It is obvious that one has a much wider freedom to improve the network performance by controlling the 
distributions of the LFNIIS. This is because one has freedom not only to control the distribution for the output layer 
but also to control the distributions for the hidden layers. The good generalization capacity of the two-layer network 
trained with the criterion di > di and n > is just benefited from the control of the distribution of the LFNIIS for 
the hidden layer. 

The application of the algorithm described in this Letter to the problem of separating a set of samples into several 
categories is straightforward by involving more neurons in the output layer. The algorithm is directly applicable to 
train single-layer networks, and can be extended straightforwardly to train networks with three or more layers. 

We want to emphasize that the MCA algorithm has capability for practical applications. For example, it takes 

(2) 

about one hour of evolution time for a personal computer to train the network satisfy the condition t^h^ > 30 by 
applying the criterion n > 0. To fulfil the same condition using the BP algorithm with optimal learning rate it takes 
about 6 computer hours. 

It is necessary to point out that the training procedure is sensitive to technical details. For example, if one replaces 
the criterion n > simply with n > in the related training preocedures above, it may need double the time to 
approach the same goals. On the other hand, certain treatments, such as adjusting the constant c gradually to its 
target value, can dramatically decrease the training time. In addition, introducing temperature to the criterion for 
acceptable adaptations can affect the efficiency of training process in a complex way. These facts imply that there is 
a big possibility to further improve the training procedure. 

Finally we briefly report an interesting phenomenon which may share lights on the role of different layers in a 
network. We have performed the MCA algorithm in two ways. One is to fix the wights in the output layer by some 
random realizations and merely adjust the weights in the hidden layer. The another is to adjust the weights in both 
layers. It was found that both ways can achieve the same goal of classification, but the training time used in the first 
way was dramatically less than that used in the second way. This implies that the role of the output layer is merely 
to span out the space. Each specific realization of weights for the output layer has a set of optimal realizations of the 
weights for the hidden layer, and every realization leads to the same target distributions of h$ and (b) h^. 

Particular thanks are given to Professor Schuster from whom I got a lot of useful ideas and suggestions related to 
this work. This work is supported in part by the National Natural Science Foundation of China under Grant No. 
10475067, and the Doctor Education Fund of the Educational Department of China. 

FIGURE CAPTIONS 
Fig.l The distributions of LFNIS of (a) h^J and (b) hft '. 

Fig. 2. The generalization capacity of the networks. 

Fig. 3. The maximum capacity of classification as functions of the neuron number in the hidden layer for J-j' G {±1} 
and for G {±3,±1}. 
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